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Abstract — Statistical disclosure control (SDC) methods 
reconcile the need to release information to researchers with 
the need to protect privacy of individual records. 
Microaggregation is a SDC method that protects data subjects 
by guarantying k-anonymity: Records are partitioned into 
groups of size at least k and actual data values are replaced by 
the group means so that each record in the group is 
indistinguishable from at least k-l other records. The goal is 
to create groups of similar records such that information loss 
due to data modification is minimized, where information 
loss is measured by the sum of squared deviations between 
the actual data values and their group means. Since optimal 
multivariate microaggregation is NP-hard, heuristics have 
been developed for microaggregation. It has been shown that 
for a given ordering of records, the optimal partition consistent 
with that ordering can be efficiently computed and some of 
the best existing microaggregation methods are based on this 
approach. This paper improves on previous heuristics by 
adapting tour construction and tour improvement heuristics 
for the traveling salesman problem (TSP) for 
microaggregation. Specifically, the Greedy heuristic and the 
Quick Boruvka heuristic are investigated for tour construction 
and the 2-opt, 3-opt, and Lin-Kernighan heuristics are used 
for tour improvements. Computational experiments using 
benchmark datasets indicate that our method results in lower 
information loss than extant microaggregation heuristics. 

Index Terms — Disclosure Control, Microaggregation, Privacy 
protection, Tour construction heuristics, Tour improvement 
heuristics, Shortest path. 

I. Introduction 

Many databases, such as health information systems and 
U.S. census data contain information that is valuable to 
researchers for statistical purposes. At the same time, these 
databases contain private information about individuals. 
There is a tradeoff between providing information for the 
benefit of society, and restricting information for the benefit 
of individuals. Statistical disclosure control (SDC) is a set of 
techniques for providing access to aggregate information in 
statistical databases, while at the same time protecting the 
privacy of individual data subjects. See [1], [12], and [27] for 
good overviews of statistical disclosure controls methods. 
Microaggregation is a popular SDC method that that protects 
data subjects by guarantying k-anonymity (see [2], [23], and 
[26]). Under microaggregation, records are partitioned into 
groups of size at least k and actual data values are replaced 
by the group means so that each record in the group is 
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indistinguishable from at least k-l other records. Replacing 
actual data values with group means leads to information 
loss and renders the data less valuable to users. Optimal 
microaggregation aims to minimize information loss by 
grouping together records that are very similar, thereby 
maximizing homogeneity among the records of each group. 
The optimal microaggregation problem can be formally 
defined as follows: Given a data set with n records and p 
numerical attributes per record, partition the records into 
groups containing at least k records each, such that the sum 
squared error (SSE) within groups is minimized. SSE is defined 

g «f _ 2 

as 

II 

A y A /|| , where g is the number of groups, X 

i=i j=\ 

is the f h record of the i' th group, and X i is the mean for the i* 
group. 

This problem has been shown to be NP-hard when p>\ 
[21]. Since microaggregation is extensively used for SDC, 
several heuristics have been proposed that lead to low 
information loss (see e.g. [4], [6], [7], [9], [10], [11], [13], [16], 
[17], [18], [20], [24], and [25]). There is a need to develop new 
heuristics that perform better than those currently known, 
either by producing groups with lower information loss, or 
by achieving the same information loss at lower 
computational cost. Optimal univariate microaggregation {i.e., 
involving a single attribute) has been shown to be solvable 
in polynomial time by taking an ordered list of records and 
using a shortest-path algorithm to compute the lowest 
information loss k-partition for the given ordered list [14]. A 
fe-partition is a set of groups such that every group contains 
at least k elements. Optimality is guaranteed, since univariate 
data can be strictly ordered. Practical applications, however, 
require microaggregation to be performed on records 
containing multiple attributes. While multivariate data cannot 
be strictly ordered, Domingo-Ferrer et al. [9] show that, for a 
given ordering of records, the best partition consistent with 
that ordering can be identified efficiently as a shortest path 
in a network using Hansen and Mukherjee's method [14]. 
Exploiting this idea, they develop the Multivariate Hansen- 
Mukherjee (MHM) algorithm and propose several heuristics 
for ordering the records. Empirical results indicate that MHM, 
when used with good ordering heuristics, outperforms extant 
microaggregation methods. This paper builds on the work of 
[9] by investigating new methods for ordering records as a 
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first step in multivariate microaggregation. The techniques 
for ordering records are based on the tour construction and 
improvement heuristics for the traveling salesman problem 
(TSP). Using previously studied benchmark datasets, our 
method is shown to outperform extant microaggregation 
heuristics. Section II discusses the MHM based 
microaggregation method and record ordering heuristics 
proposed in [9]. Section III introduces our TSP based tour 
construction and improvement heuristics. Section IV 
describes the computational experiments used to compare 
our method with extant microaggregation heuristics. Section 
V presents our results. 

II. MHM Microaggregation Heuristic 

A. The MHM Algorithm 

The MHM algorithm introduced in [9] involves 
constructing a graph based on an ordered list of records, and 
finding the shortest path in the graph. The arcs in the shortest 
path correspond to a partition of the records that is guaranteed 
to be the lowest cost partition consistent with the specified 
ordering. A partition is said to be consistent with an ordering 
if the following is true: for any three records x ., jc, and x k 
where i<j<k if x. and x k belong to the same group g then x. 
also belongs to group g. The graph is constructed as follows: 
Given an ordered set of n records, create n nodes labeled 1 to 
n. Additionally, create a node labeled 0. For each pair of 
nodes (;', j) where i+k <= j < i+2k, create an arc directed 

ZJ —2 

where x h is a multivariate record, and j) is the centroid of 

the records in (i+1, j). Once the graph is constructed, a 
shortest-path algorithm such as Dijkstra's algorithm [8] is 
used to determine the shortest path from node to node n. 
The length of the shortest path gives the information loss 
corresponding to the best partition consistent with the 
specified ordering. 

B. Existing Record Ordering Heuristics 

Four heuristics for ordering records have been proposed 
in [9]: nearest-point-next (NPN), maximum distance (MD- 
MHM), maximum distance to average vector (MDAV-MHM), 
and centroid-based fixed-size (CBFS-MHM). The NPN 
heuristic selects the first record by computing the record 
furthest away from the centroid of the entire dataset. The 
record closest to the initial record is designated as the second 
record. The third record is closest to the second record. 
This process continues until all of the records have been 
added to the tour. The NPN heuristic runs in 0(n 2 ) time, 
where n is the number of records in the dataset. The MD- 
MHM heuristic works as follows: Find the two records r and 
s that are furthest from one another in the dataset, using 
Euclidean distance. Find the k-l closest records to r and 
form a group. Find the k-l closest records to s and form a 
group. Repeat this process, using the remaining records that 
haven't been assigned to a group until all the remaining records 



are assigned. Stitch the groups together by applying NPN to 
their centroids. The MD-MHM heuristic runs in 0(n 3 ) time. 
The MDAV-MHM heuristic is very similar to MD, but 
considers distances of records to group centroids, rather than 
distances to other records. The MDAV heuristic runs in 0{n 2 ) 
time. The CBFS-MHM designates as the first record the 
records r that is furthest from the centroid. The k-l closest 
records to r are added to its group. This process is repeated 
using the remaining records until all the records are assigned. 
The complexity of CBFS is also 0(n 2 ). 

III. Our Ordering and Improvement Heuristics 

We use two TSP based tour construction heuristics - 
Greedy and Quick Boruvka (QB) - and three tour 
improvement heuristics - 2-Opt, 3-Opt, and Lin-Kernighan 
(LK). The improvement heuristics are applied not only to the 
tours generated by Greedy and QB but also to those produced 
by the record ordering heuristics discussed in the previous 
section - NPN, MD-MHM, MDAV-MHM, and CBFS-MHM. 
The Greedy heuristic produces an ordering of records using 
several steps. First, the distances between every pair of 
records in the dataset are calculated and sorted in ascending 
order. The pair of records with the smallest distance is chosen 
and added to the path. The remaining pairs are iteratively 
considered in order, and added to the path as long as the do 
not introduce sub-tours or vertices of degree greater than 2. 
The computational complexity of the Greedy heuristic is 0(n 2 
log n). In Quick Boruvka, the records are first sorted by one 
of the attributes to generate an initial ordering of records. 
For each record in the initial ordering, the distance between 
the current record and the other records in the dataset is 
computed. The pair of records with the minimum distance is 
added, subject to the same conditions outlined in the Greedy 
heuristic above. The QB heuristic is also of complexity of 
0(n 2 log n), but empirically performs slightly faster than the 
Greedy heuristic [3]. The 2-Opt heuristic, is a tour 
improvement heuristic; that is, it starts with a tour and 
iteratively attempts to improve it. It operates by first randomly 
selecting two non-connected edges (a total of four points) 
from the tour. If re-connecting the edges result in a shorter 
overall tour the originally selected edges are replaced with 
the newly computed ones. This process continues until either 
no further exchanges result in a shorter tour, or some other 
stopping criteria are met. In the 3-opt heuristic, three 
disconnected edges are broken, and every possible way of 
reconnecting the edges into a valid tour is examined. If a tour 
with shorter distance is found, the edges are swapped, and 
the process repeats until no more exchanges are possible or 
some other stopping criteria are met. The Lin-Kernighan (LK) 
heuristic is a tour improvement heuristic that is similar to 2- 
opt and 3-opt. Like 2-opt, and 3-opt, LK takes an existing tour, 
and iteratively swaps edges to create a better tour. Lin and 
Kernighan [ 1 9] note that the choice of swapping two edges in 
2-opt or 3-opt is arbitrary, since it is possible that swapping 
more than three edges may actually result in a better tour. 
Therefore, LK attempts to build a series of sequential fe-opt 



©2012 ACEEE 
DOI:01UCSI.03.01.11 



23 



—ACEEE 



ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012 



moves (sometimes referred to as swaps, flips, or exchanges), 
rather than a 2-opt or 3-opt move, that will result in an overall 
shorter tour. A good overview of these heuristics can be 
found in [22]. 

IV. EXPERTMENTALDESIGN 

A. Datasets 

We used three benchmark datasets that were used in 
previous studies [9] to evaluate our proposed methods. The 
first dataset, "Tarragona", consists of 834 records with 13 
variables. The second dataset, "EIA", consists of 4092 
records with 10 variables. Finally, the "Census" dataset 
consists of 1080 records from the U.S. census, with 13 
variables. Consistent with previous studies, the variables 
were normalized to ensure that results are not dominated by 
any single variable. 

B. Record Ordering Algorithms 

The record ordering algorithms used by Domingo-Ferrer 
et al. [9] are NPN, MD-MHM, CBFS-MHM, and MDAV-MHM. 
These algorithms were run on the three selected datasets. 
Additionally, two new record ordering algorithms introduced 
in this study - Greedy and Quick Boruvka - were also run. 
Three different algorithms for improving existing record 
orderings - 2-opt, 3-opt, and LK - were used with all tour 
construction heuristics. 

C. Comparisons 

The six microaggregation algorithms (i.e., NPN, MD- 
MHM, MDAV-MHM, CBFS-MHM, Greedy, and QB) were 
run on the three datasets (Census, Tarragona, and EIA) using 
five typical values of k (3, 4, 5, 6, and 10). These values of k 
are consistent with those used in [9]. The three tour 
improvement heuristics (2opt, 3opt, and LK) were then 
applied to these resulting record orderings. That is, for a 
given dataset and a given value of k, four different 
experiments were run using each record ordering heuristic 
(the original algorithm with no improvement, plus the three 
improvement heuristics). Consistent with previous studies, 
we report the normalized information loss IL = 100 * SSE/SST 
(with possible values between and 100) where SST = 

g "i 2 

A y A || , and x i s tne centroid of the entire 

i=i ;=i 

dataset. All of the algorithms were implemented in the Java 
programming language, using the Sun Java 6.0 SDK. All 
algorithms were run on a computer with a dual 3.0 GHz 
processor and 2GB of RAM, running the Linux operating 
system (Ubuntu 10.04). 

V. Results 

This section details the results from applying the MD- 
MHM, MDAV-MHM, CBFS-MHM, NPN-MHM, Greedy- 
MHM, and QB-MHM algorithms, along with their 
corresponding 2-opt, 3-opt, and LK improvement heuristics, 
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against the Census, Tarragona, and EIA datasets for five 
different values of k. The information loss under each 
method is reported. For each specified value of k, the lowest 
information loss obtained for each dataset is highlighted in 
bold to emphasize the best performing method. Table 1 shows 
that in all but one case (£=10 for Census), our tour 
improvement heuristics improved upon the results obtained 
using MD-MHM. Results for the MDAV-MHM and CBFS- 
MHM family of algorithms are very similar, as shown in Tables 
2 and 3. Reductions in information loss are often substantial. 
For example with the EIA dataset, the MDAV-MHM+LK 
heuristic for k=\Q resulted in an information loss of 1.968, 
which is 3 1 % lower than the information loss (2.867) obtained 
using MDAV-MHM alone. 

TABLE I. 

INFORMATION LOSS UNDER MD-MHM AND IMPROVEMENT 
HEURISTICS 
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TABLE II. 

INFORMATION LOSS UNDER MDAV-MHM AND 
IMPROVEMENT HEURISTICS 
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TABLE III. 

INFORMATION LOSS UNDER CBFS-MHM AND 
IMPROVEMENT HEURISTICS 
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TABLE IV. 

INFORMATION LOSS UNDER NPN-MHM AND 
IMPROVEMENT HEURISTICS 



Census dy=set 


Method 


k=3 


k=4 


k=5 


k=6 


k=10 


XPX-MHM 


6 211 


3.959 


11054 


13 231 


20.227 


XPX-MHM-2cp 


5.321 


7.334 


9115 


:ZM1 


17 146 




5.300 


7.061 


S.557 


9.940 


15.424 


XPX-MHMiK 


5 040 


6.330 


8.352 


9.6S4 


14. S3 7 


Tarragona dataset 


Method 


k=3 


k=4 


k=5 




k=}0 


XPX-MHM 


17.52.6 


2 1.592 


2.S.167 


30 631 


3S.71S 


XP.X MHM-2opt 


15.237 


1S.27S 


23.0S4 


27.212 


37.910 


XPX-MHM-3cp 


14.S5S 


17.S2.5 


21.345 


25.312 


35.009 


XPX-MHMiK 


15221 


IS 410 


22.6S0 


2S.-8" 


34.03 S 


EIA data* et 


Meihod 


k=i 


k=4 


k=5 


k=6 


k=10 


XPX-MHM 


514 


0.735 


0.992. 


1.273 


2.719 


XPX-MHM+2(5* 


0.427 


0.595 


0.SS7 


1.179 


2.217 


XPXMHM-3c£t 


417 


0.607 


0.S95 


1.252 


2119 


xpy-:,:H:,:-LK 


0.425 


0.566 


0.S79 


:.:r 


l.SS- 



TABLE V. 

INFORMATION LOSS UNDER QB-MHM AND IMPROVEMENT 
HEURISTICS 
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Tables 4, 5, and 6 present the results for the NPN-MHM, QB- 
MHM, and Greedy-MHM family of algorithms respectively. 
Our tour improvement heuristics resulted in the lower infor- 
mation loss in every instance 



TABLE VI. 

INFORMATION LOSS UNDER GREEDY-MHM AND 
IMPROVEMENT HEURISTICS 
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TABLE VII. 

BEST COMBINATION OF TOUR CONSTRUCTION AND 
IMPROVEMENT HEURISTICS 
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MD-MHM 


0.392 


0.55S 


0S11 


1054 


1959 


MDAVMKM 


0390 


562 


0S26 


1 133 


:.;ss 


CBFS-MHM 


0401 


0.553 


0S49 


1.119 


1.972 


XPX-MHM 


0417 


0.566 


0.S79 


1 179 


1934 


QB-MHM 


0.392 


0.559 


0.S05 


1023 


1.932 


Creedv-MHM 


0.399 


0.554 


0S03 


1 116 


2.009 


S=:: under 


iopz 


2opt 


iopt 


LK 


LK 



Table 7 summarizes the results from tables 1-6 as follows: the 
best performing combination of tour construction and tour 
improvement heuristics for each dataset are identified under 
a specified k. The row labeled "Best under" identifies the 
specific improvement heuristic that produced the best results. 
For example, with the Census dataset, when k=3, the best 
results are obtained using a combination of Greedy-MHM 
and the LK tour improvement heuristic with an information 
loss of 5.002%. While no tour construction heuristic 
dominates, out of 15 total instances, the best result was found 
using the LK improvement heuristic 7 times and using the 3- 
opt improvement heuristic 6 times. Table 8 shows the 
difference between the lowest information loss obtained using 
extant microaggregation methods (the minimum value found 
from the MD, MDAV, CBFS, MD-MHM, CBFS-MHM, NPN- 
MHM, and MDAV-MHM heuristics) and those obtained 
using the new methods described in this paper. Under our 
methods information losses are decreased by 4%, 5.5%, and 
13.9% on the average for the Census, Tarragona, and EIA 
datasets respectively. In all but one instance (k=l0 with the 
census dataset) our proposed method outperformed existing 
methods and should hence be considered as a promising 
microaggregation heuristic. 
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TABLE VIII. 

RESULTS COMPARED TO EXTANT METHODS 



b=3 k=4 k=5 k=£ h=l& 


Census 


Prfviju? E.e;t 


5.333 


7.157 




3.3-5 


:3.8S~ 


New Best 


5.002 


£739 


BJOS 


9J7? 




improvement 


6.2% 


5.3% 


4.3% 


2.7% 






Previous Best 


:5iC- 


18 5 3- 


22295 


25.533 


34369 


New Best 




17.633 


20.732 


24658 


32. 


improvement 


5.5% 


4.S% 


6.3% 


3.3% 


6.5% 


EIA 


Previous Best 


0.453 


0.654 


0.335 


:.:5" 


2.:S3 


New Best 


D390 


0.553 


O.S03 


1.023 


1,959 


Improvement 


13.3% 


:5.-% 


13.5% 


11.1% 


-.2.5% 
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